Stellar Data Analysis

Authors

Jayrup Nakawala (u2613621)

Yogi Patel (u2536809)

Jasmi Alasapuri (u2571395)

Query 2.1: Galactic Plane vs Halo

from pyspark.sql import SparkSession
from pyspark.sql.functions import * 
import pyspark.sql.functions as F
from pyspark.sql.window import Window
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Initialize Spark
spark = SparkSession.builder \
.appName("Member2_Galactic_Structure") \
.config("spark.driver.memory", "4g") \
.getOrCreate()

# 1. Load the Data
parquet_path = "../data/gaia_survey.parquet"

df = spark.read.parquet(parquet_path)


# describe the gaia_survey data
df.describe().show()
+-------+--------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+------------------+------------------+-----------------+
|summary|           source_id|                  ra|                dec|            parallax|     parallax_error|               pmra|              pmdec|   phot_g_mean_mag|             bp_rp|     teff_gspphot|
+-------+--------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+------------------+------------------+-----------------+
|  count|              844868|              844868|             844868|              844868|             844868|             844868|             844868|            844868|            821788|           719674|
|   mean|4.208655544660478E18|  220.37382801385456|-14.661386665029772|  0.5507125399244238|0.12661974095717216|-2.3739633600053764|-3.0263002891848507|17.404869147894882|1.5300547396294117| 4878.39501996751|
| stddev|1.770170248935192...|    84.6737241202139|  38.84441601829223|  0.7697707648724885|0.08883984995128962|  7.009197284004899|  6.990431755441338|1.4361723326971774|0.6075873126388783|1028.365434900732|
|    min|      42159399217024|2.620729015336566...| -89.92247730288476|1.868718209532133E-7|        0.007789557|-377.74180694687686| -565.3362997837297|         3.0238004|       -0.54805183|         2739.997|
|    max| 6917515734718201472|   359.9982847741013|  89.77689025849239|   75.56887569382528|           1.519277|  649.0319386167508|  340.6398385516719|         18.999998|          7.822592|         37489.88|
+-------+--------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+------------------+------------------+-----------------+
# 2. Register as a Temp View 
# This allows us to use SQL commands on the dataframe 'df'
df.createOrReplaceTempView("gaia_source")

print(">>> Executing Query 2.1: Galactic Plane vs Halo...")

# 3. The Spark SQL Query
# We use a CTE (Common Table Expression) named 'CalculatedData' to do the math first,
# and then the main query does the aggregation. 
query = """
    WITH CalculatedData AS (
        SELECT 
            source_id,
            -- Calculate Total Motion (Hypotenuse of pmra and pmdec)
            SQRT(POW(pmra, 2) + POW(pmdec, 2)) AS total_motion,
            
            -- Define Region using the 'DEC Proxy' method
            CASE 
                WHEN ABS(dec) < 15 THEN 'Galactic Plane' 
                ELSE 'Galactic Halo' 
            END AS region
        FROM gaia_source
    )
    
    -- Final Aggregation
    SELECT 
        region,
        ROUND(AVG(total_motion), 2) AS avg_speed,
        ROUND(STDDEV(total_motion), 2) AS stddev_speed,
        COUNT(*) AS star_count
    FROM CalculatedData
    GROUP BY region
    ORDER BY avg_speed DESC
"""

# 4. Run the query
sql_results = spark.sql(query)

# 5. Show results
sql_results.show()
>>> Executing Query 2.1: Galactic Plane vs Halo...
+--------------+---------+------------+----------+
|        region|avg_speed|stddev_speed|star_count|
+--------------+---------+------------+----------+
| Galactic Halo|     7.03|        7.73|    691873|
|Galactic Plane|     6.96|        8.96|    152995|
+--------------+---------+------------+----------+

Query 2.1 Analysis

Objective & Methodology

The objective of this query was to identify stellar kinematics by comparing the proper motion of stars in the dense Galactic Disk and Galactic Halo.

  • Metric: We calculated the “Total Proper Motion” (\(\mu\)) for each star by combining its two components: \(\mu = \sqrt{\texttt{pmra}^2 + \texttt{pmdec}^2}\).
  • Segmentation: Due to dataset constraints, we utilized Declination (dec) as a proxy for Galactic Latitude. We defined the “Galactic Plane” as the equatorial band (\(|dec| < 15^{\circ}\)) and the “Halo” as the high-latitude regions (\(|dec| \ge 15^{\circ}\)).

Finding A: The Problem “Missed Galaxy”(Star Count)

  • the simple Reason is that the Milky Way is tilted.
  • Our query only sees the flat strip across the middle (dec between -15 and 15 degrees).Because of that our “Flat strip” missed the biggest parts of the galaxy.

Finding B: The Velocity Variation

  • We Expected the “Halo” to have a higher velocity than the “Plane”.Instead the “Disk” have the biggest range (8.96 vs 7.73).

The simple reason is the distance changes how speed looks.

  • The Disk: Contains many stars that are close to Earth. Because they are close, their speeds look dramatic and varied to our camera.

  • The Halo: Stars are incredibly far away. Even if they are moving fast, their distance makes them all appear to move slowly and steadily, leading to a “lower” measurement.

Visualization

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 1. Take a random 1% sample (Crucial for performance)
pdf_subset = df.select("ra", "dec").sample(fraction=0.01, seed=42).toPandas()

# 2. Re-create the Region logic (using numpy to avoid errors)
pdf_subset['Region'] = pdf_subset['dec'].apply(lambda x: 'Galactic Plane' if np.abs(x) < 15 else 'Galactic Halo')

# 3. SET THE THEME: Dark Background for a "Space" look
plt.style.use('dark_background')
plt.figure(figsize=(12, 7))

# 4. Plot the "Halo" (Background Stars)
# We plot these first in cool blue so they look "distant"
sns.scatterplot(
    data=pdf_subset[pdf_subset['Region'] == 'Galactic Halo'], 
    x='ra', 
    y='dec', 
    color='cornflowerblue', 
    s=5,           # Small dots
    alpha=0.3,     # Faint transparency
    edgecolor=None,
    label='Galactic Halo (Sparse)'
)

# 5. Plot the "Plane" (The Disk)
# We plot these on top in bright Gold to represent the dense star field
sns.scatterplot(
    data=pdf_subset[pdf_subset['Region'] == 'Galactic Plane'], 
    x='ra', 
    y='dec', 
    color='#FFD700', # Gold color
    s=10,            # Slightly larger dots
    alpha=0.4,       # Brighter
    edgecolor=None,
    label='Galactic Plane (Dense)'
)

# Draw the cut-off lines
plt.axhline(15, color='white', linestyle='--', linewidth=1, alpha=0.5)
plt.axhline(-15, color='white', linestyle='--', linewidth=1, alpha=0.5)

# Add text labels on the graph
plt.text(180, 0, "Milky Way Disk\n(High Density)", color='orangered', 
         ha='center', va='center', fontsize=12, )

plt.text(180, 60, "Galactic Halo\n(Low Density)", color='cornflowerblue', 
         ha='center', va='center', fontsize=10)

# 7. Final Polish
plt.title("Spatial Structure: The 'Flat' Disk vs. The 'Round' Halo", fontsize=14, color='white')
plt.xlabel("Right Ascension (Longitude)", fontsize=12)
plt.ylabel("Declination (Latitude)", fontsize=12)
plt.legend(loc='upper right', facecolor='black', edgecolor='white')
plt.grid(False) # Turn off grid to look more like space

# Astronomers view the sky looking "up", so we invert the X-axis
plt.gca().invert_xaxis()

plt.show()

Query 2.2 Star Density Sky Map

print(">>> Executing Query 2.2: Star Density Sky Map")

query_density = """
    WITH DensityBins AS (
        SELECT 
            -- 1. Spatial Binning (The 'Grid')
            -- We divide by 2, floor it to remove decimals, then multiply by 2
            -- This snaps every star to the nearest even number grid line (0, 2, 4...)
            FLOOR(ra / 2) * 2 AS ra_bin,
            FLOOR(dec / 2) * 2 AS dec_bin,
            
            -- 2. Aggregation (Counting stars in that grid square)
            COUNT(*) AS star_count
        FROM gaia_source
        GROUP BY 1, 2  -- Group by the first two columns (ra_bin, dec_bin)
    ),
    
    RankedRegions AS (
        SELECT 
            *,
            -- Rank the bins from most populated (1) to least populated
            RANK() OVER (ORDER BY star_count DESC) as density_rank
        FROM DensityBins
    )
    
    -- 4. Final Result: Top 5 Densest Regions
    SELECT * FROM RankedRegions 
    WHERE density_rank <= 5
"""

# 3. Run and Show
sql_density_results = spark.sql(query_density)
sql_density_results.show()
>>> Executing Query 2.2: Star Density Sky Map
+------+-------+----------+------------+
|ra_bin|dec_bin|star_count|density_rank|
+------+-------+----------+------------+
|   270|    -30|      2135|           1|
|   272|    -30|      2096|           2|
|   268|    -30|      2036|           3|
|   272|    -28|      2011|           4|
|   274|    -28|      1762|           5|
+------+-------+----------+------------+
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/11 16:48:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Query 2.2 Analysis

Objectives and Methodologies

The goal of this query was to create the “Stellar Density Map” to identify the most populated region in the sky.

  • Spatial Binning Strategy: We have utilized a qantization approach to divide the continuous sky into discrete grid. FLOOR(coordinate / 2) * 2 to both Right Ascension (ra) and Declination (dec), we grouped stars into \(2^{\circ} \times 2^{\circ}\) spatial bins.
  • Analytical Complexity: To Rank this density, we have incorporated the pyspark Window Function (RANK() OVER (ORDER BY star_count DESC)).

Critical Analysis:

  1. The results has the perfect validation of the binning algorithm and identified Galactic centre.
  2. Astronomical Validation: The coordinates returned (\(RA \approx 270^{\circ}\), \(Dec \approx -30^{\circ}\)) correspond precisely to the constellation Sagittarius.
  • The official coordinates of Sagittarius A* (the supermassive black hole at the center of the Milky Way)re \(RA \approx 266^{\circ}\), \(Dec \approx -29^{\circ}\).

Interpretation:

The high star count in these bins conform that confirm that we are looking through the galactic plate and the centre bulge.

Visualization

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.sql.functions import col

# 1. Reuse the "DensityBins" logic from Query 2.2, but save it as a real Dataframe
density_bins = spark.sql("""
    SELECT 
        FLOOR(ra / 2) * 2 AS ra_bin, 
        FLOOR(dec / 2) * 2 AS dec_bin,
        COUNT(*) as local_density
    FROM gaia_source 
    GROUP BY 1, 2
""")

# We give every star a new property: "How many neighbors do I have?"
# We calculate the bins on the fly for the join
df_with_density = df.withColumn("ra_bin", floor(col("ra") / 2) * 2) \
                    .withColumn("dec_bin", floor(col("dec") / 2) * 2) \
                    .join(density_bins, ["ra_bin", "dec_bin"])

# This is enough to look like "every star" to the human eye without crashing.
pdf_visual = df_with_density.sample(fraction=0.05, seed=42).select("ra", "dec", "local_density").toPandas()

# 4. The "Glowing" Scatter Plot
plt.style.use('default')
plt.figure(figsize=(14, 8))

# We sort by density so the bright stars are plotted ON TOP of the dark ones
pdf_visual = pdf_visual.sort_values("local_density")

scatter = plt.scatter(
    pdf_visual['ra'], 
    pdf_visual['dec'], 
    c=pdf_visual['local_density'], # Color by density
    cmap='magma',                # Magma/Inferno = glowing fire effect
    s=2,                           # Tiny dots 
    alpha=0.8,                     # High opacity to make them pop
    edgecolors='none'              # No borders
)

# 5. Add a Color Bar (Legend)
cbar = plt.colorbar(scatter)
cbar.set_label('Stellar Density (Stars per bin)', rotation=270, labelpad=20, color='Black')
cbar.ax.yaxis.set_tick_params(color='Black')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='Black')

# 6. Styling
plt.title("The Milky Way: Stellar Density Visualization", fontsize=18, color='Black' )
plt.xlabel("Right Ascension", fontsize=12, color='Black')
plt.ylabel("Declination", fontsize=12, color='Black')
plt.gca().invert_xaxis() # Astronomical standard
plt.grid(False)

plt.show()

Query 2.3 Parallax Error vs Brightness

print(">>> Executing Query 2.3: Parallax Error vs. Brightness...")

query_quality = """
    WITH QualityMetrics AS (
        SELECT 
            -- 1. Create Brightness Bins (0-21)
            -- FLOOR groups '10.2' and '10.9' into bin '10'
            FLOOR(phot_g_mean_mag) AS mag_bin,
            
            -- Pass through the raw data we need
            parallax_error,
            parallax
        FROM gaia_source
        WHERE parallax > 0  -- Filter out bad data to avoid division by zero
    )
    
    -- 2. Aggregation to find the Error Trend
    SELECT 
        mag_bin AS magnitude,
        
        -- A. Count how many stars are in this brightness range
        COUNT(*) AS star_count,
        
        -- B. Average Absolute Error (Raw uncertainty)
        ROUND(AVG(parallax_error), 4) AS avg_raw_error,
        
        -- C. Average Relative Error (The "Percentage" uncertainty)
        -- Formula: Error / Total Signal
        ROUND(AVG(parallax_error / parallax), 4) AS avg_relative_error
        
    FROM QualityMetrics
    WHERE mag_bin > 0 AND mag_bin < 22 -- Focus on the valid main range
    GROUP BY mag_bin
    ORDER BY mag_bin ASC -- Sort from Bright -> Dim
"""

# 3. Run and Show
quality_results = spark.sql(query_quality)
quality_results.show(25) # Show 25 rows to see the full range
>>> Executing Query 2.3: Parallax Error vs. Brightness...
+---------+----------+-------------+------------------+
|magnitude|star_count|avg_raw_error|avg_relative_error|
+---------+----------+-------------+------------------+
|        3|         1|       0.1317|            0.0438|
|        4|         1|       0.0749|            0.0146|
|        5|         7|       0.0586|             0.023|
|        6|        27|       0.0426|            0.0125|
|        7|        61|       0.0357|             0.013|
|        8|       197|       0.0401|            0.0207|
|        9|       488|       0.0263|             0.017|
|       10|      1304|       0.0264|            0.0229|
|       11|      3007|       0.0265|            0.0299|
|       12|      7054|       0.0246|            0.0487|
|       13|     15489|       0.0239|            0.0655|
|       14|     32688|       0.0297|            0.2688|
|       15|     66266|       0.0411|            1.2044|
|       16|    125846|       0.0633|            0.4907|
|       17|    223654|        0.105|            1.4188|
|       18|    368778|       0.1929|            4.8654|
+---------+----------+-------------+------------------+

Query 2.3 Analysis

Objectives and Methodologies

The objective of this query was to evaluate the steller technique’s depandibity at various star magnitudes. Before entering to the machine learning part, it is essetional to indentify the “Signal-to-Noice” ratios.

  • Binning Strategy: We grouped stars by their Apparent Magnitude (phot_g_mean_mag) into integer bins (e.g., Magnitude 10, 11, 12…).
  • For each bin we have clacualted:
    • Average Absolute Error AVG(parallax_error)
    • Average Relative Error AVG(parallax_error / parallax)
  • Statistics:
    • Bright stars - (Mag < 13 ) High photon counts result in high-precision centroids, leading to low parallax error.
    • Dim stars - (Mag > 13 )Lower signal-to-noise ratios should result in exponentially increasing errors.

Visualization

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# 1. Convert Spark Result to Pandas
pdf_quality = quality_results.toPandas()

# 2. Setup the Plot (Dual Axis)
# We want to show Star Count (Bars) AND Error Rate (Line) on the same chart
fig, ax1 = plt.subplots(figsize=(12, 6))

# 3. Plot A: Star Count (The Histogram) on Left Axis
# This shows where most of our data lives
sns.barplot(
    data=pdf_quality, 
    x='magnitude', 
    y='star_count', 
    color='cornflowerblue', 
    alpha=0.3, 
    ax=ax1,
    label='Star Count'
)
ax1.set_ylabel('Number of Stars (Log Scale)', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_yscale('log') # Log scale because star counts vary wildly (1 to 300,000)

# 4. Plot B: The Parallax Error (AVG Error) on Right Axis
ax2 = ax1.twinx()

sns.lineplot(
    data=pdf_quality, 
    x=ax1.get_xticks(), # Align line with bars
    y='avg_raw_error',  # AVG(parallax_error)
    color='red', 
    marker='o', 
    linewidth=2,
    ax=ax2,
    label='Avg Parallax Error (mas)'
)
ax2.set_ylabel('Avg Parallax Error (milliarcseconds)', color='red')
ax2.tick_params(axis='y', labelcolor='red')

# 5. Titles and Layout
plt.title("Data Quality Audit (Error Rate vs. Brightness)", fontsize=14)
ax1.set_xlabel("Apparent Magnitude (Lower = Brighter)", fontsize=12)
plt.grid(True, linestyle=':', alpha=0.5)

plt.tight_layout()
plt.show()